{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Time series aggregation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Time series aggregation involves summarizing or transforming data over specific time intervals. Two of the most common use cases are\n",
    "\n",
    "+ Aggregating data from one frequency to another. For example, converting hourly data to daily data. \n",
    "\n",
    "+ Aggregating data across a sliding window. For example, calculating a rolling average over the last 7 days.\n",
    "\n",
    "These aggregations not only greatly reduce the total volume of data, but also help you find interesting features for your model faster.\n",
    "\n",
    "**Pandas** provides an easy and efficient way to aggregate data from time series. This document shows how to use the `resampling` and `rolling` methods to aggregate data, with special emphasis on how to avoid data leakage, which is a common mistake when aggregating time series."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Libraries and data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Libraries\n",
    "# ==============================================================================\n",
    "import pandas as pd\n",
    "from skforecast.datasets import fetch_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭──────────────────────────── <span style=\"font-weight: bold\">vic_electricity</span> ─────────────────────────────╮\n",
       "│ <span style=\"font-weight: bold\">Description:</span>                                                             │\n",
       "│ Half-hourly electricity demand for Victoria, Australia                   │\n",
       "│                                                                          │\n",
       "│ <span style=\"font-weight: bold\">Source:</span>                                                                  │\n",
       "│ O'Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022).tsibbledata: Diverse │\n",
       "│ Datasets for 'tsibble'. https://tsibbledata.tidyverts.org/,              │\n",
       "│ https://github.com/tidyverts/tsibbledata/.                               │\n",
       "│ https://tsibbledata.tidyverts.org/reference/vic_elec.html                │\n",
       "│                                                                          │\n",
       "│ <span style=\"font-weight: bold\">URL:</span>                                                                     │\n",
       "│ https://raw.githubusercontent.com/skforecast/skforecast-                 │\n",
       "│ datasets/main/data/vic_electricity.csv                                   │\n",
       "│                                                                          │\n",
       "│ <span style=\"font-weight: bold\">Shape:</span> 52608 rows x 4 columns                                            │\n",
       "╰──────────────────────────────────────────────────────────────────────────╯\n",
       "</pre>\n"
      ],
      "text/plain": [
       "╭──────────────────────────── \u001b[1mvic_electricity\u001b[0m ─────────────────────────────╮\n",
       "│ \u001b[1mDescription:\u001b[0m                                                             │\n",
       "│ Half-hourly electricity demand for Victoria, Australia                   │\n",
       "│                                                                          │\n",
       "│ \u001b[1mSource:\u001b[0m                                                                  │\n",
       "│ O'Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022).tsibbledata: Diverse │\n",
       "│ Datasets for 'tsibble'. https://tsibbledata.tidyverts.org/,              │\n",
       "│ https://github.com/tidyverts/tsibbledata/.                               │\n",
       "│ https://tsibbledata.tidyverts.org/reference/vic_elec.html                │\n",
       "│                                                                          │\n",
       "│ \u001b[1mURL:\u001b[0m                                                                     │\n",
       "│ https://raw.githubusercontent.com/skforecast/skforecast-                 │\n",
       "│ datasets/main/data/vic_electricity.csv                                   │\n",
       "│                                                                          │\n",
       "│ \u001b[1mShape:\u001b[0m 52608 rows x 4 columns                                            │\n",
       "╰──────────────────────────────────────────────────────────────────────────╯\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Demand</th>\n",
       "      <th>Temperature</th>\n",
       "      <th>Holiday</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Time</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2011-12-31 13:00:00</th>\n",
       "      <td>4382.825174</td>\n",
       "      <td>21.40</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 13:30:00</th>\n",
       "      <td>4263.365526</td>\n",
       "      <td>21.05</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 14:00:00</th>\n",
       "      <td>4048.966046</td>\n",
       "      <td>20.70</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 14:30:00</th>\n",
       "      <td>3877.563330</td>\n",
       "      <td>20.55</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 15:00:00</th>\n",
       "      <td>4036.229746</td>\n",
       "      <td>20.40</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          Demand  Temperature  Holiday\n",
       "Time                                                  \n",
       "2011-12-31 13:00:00  4382.825174        21.40     True\n",
       "2011-12-31 13:30:00  4263.365526        21.05     True\n",
       "2011-12-31 14:00:00  4048.966046        20.70     True\n",
       "2011-12-31 14:30:00  3877.563330        20.55     True\n",
       "2011-12-31 15:00:00  4036.229746        20.40     True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Download data\n",
    "# ==============================================================================\n",
    "data = fetch_dataset(name='vic_electricity')\n",
    "data = data[['Demand', 'Temperature', 'Holiday']]\n",
    "data.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset contains the electricity demand in Victoria (Australia) at half-hourly frequency."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frequency: <30 * Minutes>\n"
     ]
    }
   ],
   "source": [
    "# Index Frequency\n",
    "# ==============================================================================\n",
    "print(f\"Frequency: {data.index.freq}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Change frequency (resample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To change the frequency of a time series, use the `resample` method. This method allows you to specify a frequency and an aggregation function. It works similarly to the `groupby` method, but it works with time series indices.\n",
    "\n",
    "When aggregating data, it is very important to use the `closed` and `label` arguments correctly. This avoids introducing future information into the training (data leakage). \n",
    "\n",
    "+ The `closed` argument specifies whether the interval is closed on the left-side, right-side, both or neither. \n",
    "\n",
    "+ The `label` argument specifies whether the result is labeled with the beginning or the end of the interval. \n",
    "\n",
    "Suppose that values are available for 10:10, 10:30, 10:45, 11:00, 11:12, and 11:30. To obtain the hourly average, the value assigned to 11:00 must be calculated using the values for 10:10, 10:30, and 10:45; and the value assigned to 12:00 must be calculated using the value for 11:00, 11:12 and 11:30. The 11:00 average does not include the 11:00 point value because in reality the value is not available at that exact time.\n",
    "\n",
    "In this case, the correct arguments are `closed='left'` and `label='right'`.\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../img/diagram-resampling-datetime.png\" style=\"width: 500px;\">\n",
    "<br>\n",
    "<font size='2.5'> <i>Diagram of data aggregation using the resample method without including future information.</i></font>\n",
    "</p>\n",
    "\n",
    "For example, the code in the next cell converts the data from half-hourly to hourly frequency. Since there are multiple columns, an aggregation function must be specified for each column. In this case, the `sum` is calculated for the `Demand` column and the `average` is calculated for the rest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Demand</th>\n",
       "      <th>Temperature</th>\n",
       "      <th>Holiday</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Time</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2011-12-31 14:00:00</th>\n",
       "      <td>8646.190700</td>\n",
       "      <td>21.225</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 15:00:00</th>\n",
       "      <td>7926.529376</td>\n",
       "      <td>20.625</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 16:00:00</th>\n",
       "      <td>7901.826990</td>\n",
       "      <td>20.325</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 17:00:00</th>\n",
       "      <td>7255.721350</td>\n",
       "      <td>19.850</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 18:00:00</th>\n",
       "      <td>6792.503352</td>\n",
       "      <td>19.025</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 09:00:00</th>\n",
       "      <td>8139.251100</td>\n",
       "      <td>21.600</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 10:00:00</th>\n",
       "      <td>7818.461408</td>\n",
       "      <td>20.300</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 11:00:00</th>\n",
       "      <td>7801.201802</td>\n",
       "      <td>19.650</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 12:00:00</th>\n",
       "      <td>7516.472988</td>\n",
       "      <td>18.100</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 13:00:00</th>\n",
       "      <td>7571.301440</td>\n",
       "      <td>17.200</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>26304 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                          Demand  Temperature  Holiday\n",
       "Time                                                  \n",
       "2011-12-31 14:00:00  8646.190700       21.225      1.0\n",
       "2011-12-31 15:00:00  7926.529376       20.625      1.0\n",
       "2011-12-31 16:00:00  7901.826990       20.325      1.0\n",
       "2011-12-31 17:00:00  7255.721350       19.850      1.0\n",
       "2011-12-31 18:00:00  6792.503352       19.025      1.0\n",
       "...                          ...          ...      ...\n",
       "2014-12-31 09:00:00  8139.251100       21.600      0.0\n",
       "2014-12-31 10:00:00  7818.461408       20.300      0.0\n",
       "2014-12-31 11:00:00  7801.201802       19.650      0.0\n",
       "2014-12-31 12:00:00  7516.472988       18.100      0.0\n",
       "2014-12-31 13:00:00  7571.301440       17.200      0.0\n",
       "\n",
       "[26304 rows x 3 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Aggregate data from 30 minutes to 1 hour\n",
    "# ==============================================================================\n",
    "data = data.resample(rule='1h', closed='left', label ='right').agg({\n",
    "           'Demand': 'sum',\n",
    "           'Temperature': 'mean',\n",
    "           'Holiday': 'mean'\n",
    "       })\n",
    "\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rolling window aggregation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rolling window aggregation is used to calculate statistics over a sliding window of time. For example, the 24h rolling average is the average of the last 24 hours of data. As with the `resample` method, it is very important to use the `closed='left'` and `center=False` arguments correctly to avoid introducing future information into the training (data leakage)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Demand</th>\n",
       "      <th>Temperature</th>\n",
       "      <th>Holiday</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Time</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2011-12-31 14:00:00</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 15:00:00</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 16:00:00</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 17:00:00</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-12-31 18:00:00</th>\n",
       "      <td>7932.567104</td>\n",
       "      <td>20.50625</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 09:00:00</th>\n",
       "      <td>8490.517461</td>\n",
       "      <td>23.71250</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 10:00:00</th>\n",
       "      <td>8482.825404</td>\n",
       "      <td>23.41250</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 11:00:00</th>\n",
       "      <td>8314.896216</td>\n",
       "      <td>22.67500</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 12:00:00</th>\n",
       "      <td>8076.417548</td>\n",
       "      <td>21.30000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-12-31 13:00:00</th>\n",
       "      <td>7818.846825</td>\n",
       "      <td>19.91250</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>26304 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                          Demand  Temperature  Holiday\n",
       "Time                                                  \n",
       "2011-12-31 14:00:00          NaN          NaN      NaN\n",
       "2011-12-31 15:00:00          NaN          NaN      NaN\n",
       "2011-12-31 16:00:00          NaN          NaN      NaN\n",
       "2011-12-31 17:00:00          NaN          NaN      NaN\n",
       "2011-12-31 18:00:00  7932.567104     20.50625      1.0\n",
       "...                          ...          ...      ...\n",
       "2014-12-31 09:00:00  8490.517461     23.71250      0.0\n",
       "2014-12-31 10:00:00  8482.825404     23.41250      0.0\n",
       "2014-12-31 11:00:00  8314.896216     22.67500      0.0\n",
       "2014-12-31 12:00:00  8076.417548     21.30000      0.0\n",
       "2014-12-31 13:00:00  7818.846825     19.91250      0.0\n",
       "\n",
       "[26304 rows x 3 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Rolling mean for 4 hours\n",
    "# ==============================================================================\n",
    "data.rolling(window=4, min_periods=4, closed='left', center=False).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The average values for `2011-12-31 18:00:00` are calculated from the values of the previous 4 hours (from 14:00:00 to 17:00:00)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Demand         7932.567104\n",
       "Temperature      20.506250\n",
       "Holiday           1.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# '2011-12-31 18:00:00' mean\n",
    "# ==============================================================================\n",
    "data.iloc[0:4, :].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note\" name=\"html-admonition\" style=\"background: rgba(255,145,0,.1); padding-top: 0px; padding-bottom: 6px; border-radius: 8px; border-left: 8px solid #ff9100; border-color: #ff9100; padding-left: 10px; padding-right: 10px\">\n",
    "\n",
    "<p class=\"title\">\n",
    "    <i style=\"font-size: 18px; color:#ff9100; border-color: #ff1744;\"></i>\n",
    "    <b style=\"color: #ff9100;\"> <span style=\"color: #ff9100;\">&#9888;</span> Warning</b>\n",
    "</p>\n",
    "\n",
    "When transforming time series data, such as aggregating, it is very important to avoid data leakage, which means using information from the future to calculate the current value.\n",
    "\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "skforecast_py12",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
